user study
Sparse Autoencoders Learn Monosemantic Features in Vision-Language Models
Sparse Autoencoders (SAEs) have recently gained attention as a means to improve the interpretability and steerability of Large Language Models (LLMs), both of which are essential for AI safety. In this work, we extend the application of SAEs to Vision-Language Models (VLMs), such as CLIP, and introduce a comprehensive framework for evaluating monosemanticity at the neuron-level in visual representations. To ensure that our evaluation aligns with human perception, we propose a benchmark derived from a large-scale user study. Our experimental results reveal that SAEs trained on VLMs significantly enhance the monosemanticity of individual neurons, with sparsity and wide latents being the most influential factors. Further, we demonstrate that applying SAE interventions on CLIP's vision encoder directly steers multimodal LLM outputs (e.g., LLaVA), without any modifications to the underlying language model. These findings emphasize the practicality and efficacy of SAEs as an unsupervised tool for enhancing both interpretability and control of VLMs.
A.1 Qualitative Results of Bench
Figure 5: Word clouds of text prompts for the text-only generation (T2I) task (left) and the multimodal generation task (right). Figure 5 visually summarizes the prominent semantic elements in the benchmark prompts for text-only492 (T2I) and multimodal generation tasks. The differentiation of the word clouds reflects task-specific493 features of MMGen-Bench, emphasizing spatial and descriptive details in T2I tasks, while multimodal494 tasks more frequently involve social and interactive scenarios.495 Aspect Objects Relations Attributes Counting Overall Spearman ฯ 0.469 0.909 0.601 0.839 0.699 As depicted in Figure 6, the distribution of aspect types differs notably between the text-only497 generation (T2I) and multi-modal generation tasks. In the T2I setting, "Objects" dominate with498 38.3%, while "Attributes" and "Relations" also constitute substantial proportions (33.9% and 25.4%,499 respectively).
DEXTER: Diffusion-Guided EXplanations with TExtual Reasoning for Vision Models
Understanding and explaining the behavior of machine learning models is essential for building transparent and trustworthy AI systems. We introduce DEXTER, a data-free framework that employs diffusion models and large language models to generate global, textual explanations of visual classifiers. DEXTER operates by optimizing text prompts to synthesize class-conditional images that strongly activate a target classifier. These synthetic samples are then used to elicit detailed natural language reports that describe class-specific decision patterns and biases. Unlike prior work, DEXTER enables natural language explanation about a classifier's decision process without access to training data or groundtruth labels. We demonstrate DEXTER's flexibility across three tasks--activation maximization, slice discovery and debiasing, and bias explanation--each illustrating its ability to uncover the internal mechanisms of visual classifiers. Quantitative and qualitative evaluations, including a user study, show that DEXTER produces accurate, interpretable outputs. Experiments on ImageNet, Waterbirds, CelebA, and FairFaces confirm that DEXTER outperforms existing approaches in global model explanation and class-level bias reporting.
BenchmarkCards: Standardized Documentation for Large Language Model Benchmarks
Large language models (LLMs) are powerful tools capable of handling diverse tasks. Comparing and selecting appropriate LLMs for specific tasks requires systematic evaluation methods, as models exhibit varying capabilities across different domains. However, finding suitable benchmarks is difficult given the many available options. This complexity not only increases the risk of benchmark misuse and misinterpretation but also demands substantial effort from LLM users, seeking the most suitable benchmarks for their specific needs. To address these issues, we introduce BenchmarkCards, an intuitive and validated documentation framework that standardizes critical benchmark attributes such as objectives, methodologies, data sources, and limitations. Through user studies involving benchmark creators and users, we show that BenchmarkCardscan simplify benchmark selection and enhance transparency, facilitating informed decision-making in evaluating LLMs.